A Sandhi Splitter for Malayalam
نویسندگان
چکیده
Sandhi splitting is the primary task for computational processing of text in Sanskrit and Dravidian languages. In these languages, words can join together with morpho-phonemic changes at the point of joining. This phenomenon is known as Sandhi. Sandhi splitter splits the string of conjoined words into individual words. Accurate execution of sandhi splitting is crucial for text processing tasks such as POS tagging, topic modelling and document indexing. We have tried different approaches to address the challenges of sandhi splitting in Malayalam, and finally, we have thought of exploiting the phonological changes that take place in the words while joining. This resulted in a hybrid method which statistically identifies the split points and splits using predefined character level linguistic rules. Currently, our system gives an accuracy of 91.1% .
منابع مشابه
Statistical Sandhi Splitter for Agglutinative Languages
Sandhi splitting is a primary and an important step for any natural language processing (NLP) application for languages which have agglutinative morphology. This paper presents a statistical approach to build a sandhi splitter for agglutinative languages. The input to the model is a valid string in the language and the output is a split of that string into meaningful word/s. The approach adopte...
متن کاملStatistical Sandhi Splitter and its Effect on NLP Applications
This paper revisits the work of (Kuncham et al., 2015) which developed a statistical sandhi splitter (SSS) for agglutinative languages that was tested for Telugu and Malayalam languages. Handling compound words is a major challenge for Natural Language Processing (NLP) applications for agglutinative languages. Hence, in this paper we concentrate on testing the effect of SSS on the NLP applicati...
متن کاملSignificance of an Accurate Sandhi-Splitter in Shallow Parsing of Dravidian Languages
This paper evaluates the challenges involved in shallow parsing of Dravidian languages which are highly agglutinative and morphologically rich. Text processing tasks in these languages are not trivial because multiple words concatenate to form a single string with morpho-phonemic changes at the point of concatenation. This phenomenon known as Sandhi, in turn complicates the individual word iden...
متن کاملDesign of Photonic Crystal Polarization Splitter on InP Substrate
In this article, we suggested a novel design of polarization splitter based on coupler waveguide on InP substrate at 1.55mm wavelength. Photonic crystal structure is consisted of two dimensional (2D) air holes embedded in InP/InGaAsP material with an effective refractive index of 3.2634 which is arranged in a hexagonal lattice. The photonic band gap (PBG) of this structure is determined using t...
متن کاملExperimental Investigation of the Effect of Splitter Plate Angle on the Under-Scouring of Submarine Pipeline Due to Steady Current and Clear Water Condition
Submarine pipelines are appropriate method for transmission of oil and gas from sea bed. Free spans may occur due to the natural uneven seabed or by under-scouring. Vortex Induced Vibration (VIV) can happen in such free spans at high Reynolds number. Resonance occurs if the frequency of vortex shedding is close to the pipeline’s natural frequency leading to its fatigue that can break the pipeli...
متن کامل